# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")
FINAL TEST SCORE: $R^2 = 0.39$¶
Lab 4: Putting it all together in a mini project¶
For this lab, you can choose to work alone of in a group of up to four students. You are in charge of how you want to work and who you want to work with. Maybe you really want to go through all the steps of the ML process yourself or maybe you want to practice your collaboration skills, it is up to you! Just remember to indicate who your group members are (if any) when you submit on Gradescope. If you choose to work in a group, you only need to use one GitHub repo (you can create one on github.ubc.ca and set the visibility to "public").
Submission instructions¶
rubric={mechanics}
You receive marks for submitting your lab correctly, please follow these instructions:
- Follow the general lab instructions.
- Click here to view a description of the rubrics used to grade the questions
- Make at least three commits.
- Push your
.ipynbfile to your GitHub repository for this lab and upload it to Gradescope. - Before submitting, make sure you restart the kernel and rerun all cells.
- Also upload a
.pdfexport of the notebook to facilitate grading of manual questions (preferably WebPDF, you can select two files when uploading to gradescope) - Don't change any variable names that are given to you, don't move cells around, and don't include any code to install packages in the notebook.
- The data you download for this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY (there is also a
.gitignorein the repo to prevent this). - Include a clickable link to your GitHub repo for the lab just below this cell
- It should look something like this https://github.ubc.ca/MDS-2020-21/DSCI_531_labX_yourcwl.
Points: 2
Introduction ¶
In this lab you will be working on an open-ended mini-project, where you will put all the different things you have learned so far in 571 and 573 together to solve an interesting problem.
A few notes and tips when you work on this mini-project:
Tips¶
- Since this mini-project is open-ended there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary.
- Do not include everything you ever tried in your submission -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code.
- If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions.
Assessment¶
We don't have some secret target score that you need to achieve to get a good grade. You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results. For example, if you just have a bunch of code and no text or figures, that's not good. If you instead do a bunch of sane things and you have clearly motivated your choices, but still get lower model performance than your friend, don't sweat it.
A final note¶
Finally, the style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "several hours" but not "many hours" is a good guideline for a high quality submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and we hope you enjoy it as well.
1. Pick your problem and explain the prediction problem ¶
rubric={reasoning}
In this mini project, you will pick one of the following problems:
- A classification problem of predicting whether a credit card client will default or not. For this problem, you will use Default of Credit Card Clients Dataset. In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with the associated research paper, which is available through the UBC library.
OR
- A regression problem of predicting
reviews_per_month, as a proxy for the popularity of the listing with New York City Airbnb listings from 2019 dataset. Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.
Your tasks:
- Spend some time understanding the problem and what each feature means. Write a few sentences on your initial thoughts on the problem and the dataset.
- Download the dataset and read it as a pandas dataframe.
- Carry out any preliminary preprocessing, if needed (e.g., changing feature names, handling of NaN values etc.)
Points: 3
New York City Airbnb Listings Problem¶
- Initially, this problem seems like it could be challenging. There are potential features with collinearity that need to be considered, along with related categorical columns like
neighbourhood_groupandneighbourhood. Additionally, there are several null values in the target column to explore further.
#imports
import sklearn
import numpy as np
import pandas as pd
import shap
import matplotlib.pyplot as plt
import seaborn as sns
import altair_ally as aly
import altair as alt
from xgboost import XGBRegressor
from sklearn.feature_selection import RFECV
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import (
OneHotEncoder,
StandardScaler,
)
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Lasso
from sklearn.ensemble import StackingRegressor
from sklearn.svm import SVR
data= pd.read_csv('data/AB_NYC_2019.csv')
data.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
2. Data splitting ¶
rubric={reasoning}
Your tasks:
- Split the data into train and test portions.
Make the decision on the
test_sizebased on the capacity of your laptop.
Points: 1
Given the large size of the dataset, only 60% of the dataset is being used for training the model while the testing is being done on the remaining 40%.
train_df, test_df = train_test_split(data, test_size = 0.4, random_state=123)
train_df.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17877 | 14010200 | Scandinavean design in Crown Heights, BK | 1683437 | Boram | Brooklyn | Crown Heights | 40.66477 | -73.95060 | Entire home/apt | 89 | 5 | 9 | 2018-05-28 | 0.25 | 1 | 0 |
| 14638 | 11563821 | Private bedroom located in the heart of Chelsea | 10307134 | Anna | Manhattan | Chelsea | 40.74118 | -74.00012 | Private room | 110 | 1 | 48 | 2019-06-16 | 1.80 | 2 | 67 |
| 7479 | 5579629 | Lovely sunlit room in Brooklyn | 329917 | Clémentine | Brooklyn | Greenpoint | 40.72905 | -73.95755 | Private room | 53 | 2 | 5 | 2016-10-21 | 0.13 | 1 | 0 |
| 47058 | 35575853 | Great view, 1 BR right next to Central Park! | 35965489 | Meygan | Manhattan | East Harlem | 40.79755 | -73.94797 | Private room | 100 | 2 | 0 | NaN | NaN | 1 | 7 |
| 9769 | 7509362 | Great BIG Upper West Side Apartment | 29156329 | Andrew | Manhattan | Upper West Side | 40.80120 | -73.96382 | Private room | 87 | 3 | 9 | 2018-09-19 | 0.19 | 1 | 0 |
3. EDA ¶
rubric={viz,reasoning}
Perform exploratory data analysis on the train set.
Your tasks:
- Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
- Summarize your initial observations about the data.
- Pick appropriate metric/metrics for assessment.
Points: 6
Summary Statistics:
The describe function provides key statistics including mean, min, max, and standard deviation, which provide an overview of the central tendencies of the data, as well the spread. Also, the nunique function helps identify the features which are unique identifiers such as id, host id, and name, which do not help the model in learning any patterns.
Visualizations:
The distribution of the numerical features is helpful to understand the spread of the features and if there is any skewness is any of them. Also, the correlation of the numerical features help identify if there are any highly correlated feature which are to be dealth with accordingly.
Metric:
For scoring models we will primarily use $R^2$ since we will be comparing several regression models. Additonal metrics, described in the later questions, will be used for feature importance and further analysis.
train_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 29337 entries, 17877 to 15725 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 29337 non-null int64 1 name 29328 non-null object 2 host_id 29337 non-null int64 3 host_name 29325 non-null object 4 neighbourhood_group 29337 non-null object 5 neighbourhood 29337 non-null object 6 latitude 29337 non-null float64 7 longitude 29337 non-null float64 8 room_type 29337 non-null object 9 price 29337 non-null int64 10 minimum_nights 29337 non-null int64 11 number_of_reviews 29337 non-null int64 12 last_review 23386 non-null object 13 reviews_per_month 23386 non-null float64 14 calculated_host_listings_count 29337 non-null int64 15 availability_365 29337 non-null int64 dtypes: float64(3), int64(7), object(6) memory usage: 3.8+ MB
train_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 29337.0 | 1.891988e+07 | 1.102155e+07 | 2539.00000 | 9.350729e+06 | 1.951751e+07 | 2.916531e+07 | 3.648561e+07 |
| host_id | 29337.0 | 6.714579e+07 | 7.835404e+07 | 2438.00000 | 7.740184e+06 | 3.071907e+07 | 1.064429e+08 | 2.743213e+08 |
| latitude | 29337.0 | 4.072901e+01 | 5.459419e-02 | 40.50641 | 4.069009e+01 | 4.072314e+01 | 4.076328e+01 | 4.091234e+01 |
| longitude | 29337.0 | -7.395222e+01 | 4.609134e-02 | -74.24442 | -7.398303e+01 | -7.395553e+01 | -7.393643e+01 | -7.371299e+01 |
| price | 29337.0 | 1.509391e+02 | 2.282242e+02 | 0.00000 | 6.900000e+01 | 1.070000e+02 | 1.750000e+02 | 1.000000e+04 |
| minimum_nights | 29337.0 | 7.141971e+00 | 2.227211e+01 | 1.00000 | 1.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.250000e+03 |
| number_of_reviews | 29337.0 | 2.335450e+01 | 4.469248e+01 | 0.00000 | 1.000000e+00 | 5.000000e+00 | 2.300000e+01 | 6.290000e+02 |
| reviews_per_month | 23386.0 | 1.369867e+00 | 1.706732e+00 | 0.01000 | 1.900000e-01 | 7.100000e-01 | 2.010000e+00 | 5.850000e+01 |
| calculated_host_listings_count | 29337.0 | 7.003340e+00 | 3.251162e+01 | 1.00000 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.270000e+02 |
| availability_365 | 29337.0 | 1.128036e+02 | 1.315445e+02 | 0.00000 | 0.000000e+00 | 4.500000e+01 | 2.270000e+02 | 3.650000e+02 |
def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
return mz_table
missing_zero_values_table(train_df)
Your selected dataframe has 16 columns and 29337 Rows. There are 4 columns that have missing values.
| Zero Values | Missing Values | % of Total Values | Total Zero Missing Values | % Total Zero Missing Values | Data Type | |
|---|---|---|---|---|---|---|
| last_review | 0 | 5951 | 20.3 | 5951 | 20.3 | object |
| reviews_per_month | 0 | 5951 | 20.3 | 5951 | 20.3 | float64 |
| host_name | 0 | 12 | 0.0 | 12 | 0.0 | object |
| name | 0 | 9 | 0.0 | 9 | 0.0 | object |
train_df.nunique()
id 29337 name 28894 host_id 23964 host_name 8375 neighbourhood_group 5 neighbourhood 217 latitude 15269 longitude 12151 room_type 3 price 591 minimum_nights 94 number_of_reviews 360 last_review 1632 reviews_per_month 867 calculated_host_listings_count 47 availability_365 366 dtype: int64
# Distribution of Numerical features
aly.alt.data_transformers.enable('vegafusion')
aly.dist(train_df)